{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* By: Harkishan Singh Baniya\n",
    "* Email: harkishansinghbaniya@gmail.com\n",
    "* Reference: Advances in Financial Machine Learning, Chapter-09"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chapter 9 : Hyper-Parameter Tuning with Cross-Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hyper-parameter tuning is an essential step in building Machine Learning algorithms. Although the ML model tuning process may seem to be no different for finance, but if not done properly the algorithm will likely to overfit and produce negative performance. As optimizing models in finance are prone to overfitting, we must consider some key points mentioned in the chapter. Some of the key takeaways from the chapter are mentioned at the end of this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## EXERCISES"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "import random\n",
    "random.seed(0)\n",
    "\n",
    "import time\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy.stats import rv_continuous,kstest\n",
    "\n",
    "from sklearn.metrics import *\n",
    "from sklearn.model_selection import GridSearchCV, RandomizedSearchCV\n",
    "from sklearn.datasets import make_classification\n",
    "\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.ensemble import BaggingClassifier\n",
    "from mlfinlab.cross_validation import PurgedKFold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 9.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the function getTestData from Chapter 8, form a synthetic dataset of 10,000 observations with 10 features, where 5 are informative and 5 are noise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "#This is the function used in Chapter 8.\n",
    "\n",
    "def get_test_data(n_features=40, n_informative=10, n_redundant=10, n_samples=10000):\n",
    "    #Generate a random dataset for a classification problem    \n",
    "    trnsX, cont = make_classification(n_samples=n_samples, n_features=n_features, \n",
    "                                      n_informative=n_informative, n_redundant=n_redundant, \n",
    "                                      random_state=0, shuffle=False) \n",
    "    df0 = pd.date_range(periods=n_samples, freq=pd.tseries.offsets.Minute(), end=pd.datetime.today()).round('S')\n",
    "    trnsX = pd.DataFrame(trnsX, index=df0)\n",
    "    cont = pd.Series(cont, index=df0).to_frame('bin')\n",
    "    df0 = ['I_%s' % i for i in range(n_informative)] + ['R_%s' % i for i in range(n_redundant)]\n",
    "    df0 += ['N_%s' % i for i in range(n_features - len(df0))]\n",
    "    trnsX.columns = df0\n",
    "    cont['w'] = 1.0 / cont.shape[0]\n",
    "    cont['t1'] = pd.Series(cont.index, index=cont.index)\n",
    "    return trnsX, cont\n",
    "\n",
    "X, cont = get_test_data(n_features=10, n_informative=5, n_redundant=0, n_samples=10000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>I_0</th>\n",
       "      <th>I_1</th>\n",
       "      <th>I_2</th>\n",
       "      <th>I_3</th>\n",
       "      <th>I_4</th>\n",
       "      <th>N_0</th>\n",
       "      <th>N_1</th>\n",
       "      <th>N_2</th>\n",
       "      <th>N_3</th>\n",
       "      <th>N_4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2020-06-17 15:21:10</th>\n",
       "      <td>2.105359</td>\n",
       "      <td>2.861661</td>\n",
       "      <td>0.104159</td>\n",
       "      <td>0.686149</td>\n",
       "      <td>1.369429</td>\n",
       "      <td>-0.868903</td>\n",
       "      <td>-1.297125</td>\n",
       "      <td>-0.160205</td>\n",
       "      <td>-0.481024</td>\n",
       "      <td>0.841338</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-17 15:22:10</th>\n",
       "      <td>-0.330754</td>\n",
       "      <td>1.464379</td>\n",
       "      <td>-1.405119</td>\n",
       "      <td>0.396713</td>\n",
       "      <td>-1.722305</td>\n",
       "      <td>0.471952</td>\n",
       "      <td>-1.443687</td>\n",
       "      <td>-0.433773</td>\n",
       "      <td>0.123114</td>\n",
       "      <td>-0.102970</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2020-06-17 15:23:10</th>\n",
       "      <td>-0.461334</td>\n",
       "      <td>-0.160432</td>\n",
       "      <td>-2.169501</td>\n",
       "      <td>-0.137535</td>\n",
       "      <td>0.398229</td>\n",
       "      <td>-0.278979</td>\n",
       "      <td>-1.860566</td>\n",
       "      <td>0.909540</td>\n",
       "      <td>-0.396742</td>\n",
       "      <td>2.455228</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          I_0       I_1       I_2       I_3       I_4  \\\n",
       "2020-06-17 15:21:10  2.105359  2.861661  0.104159  0.686149  1.369429   \n",
       "2020-06-17 15:22:10 -0.330754  1.464379 -1.405119  0.396713 -1.722305   \n",
       "2020-06-17 15:23:10 -0.461334 -0.160432 -2.169501 -0.137535  0.398229   \n",
       "\n",
       "                          N_0       N_1       N_2       N_3       N_4  \n",
       "2020-06-17 15:21:10 -0.868903 -1.297125 -0.160205 -0.481024  0.841338  \n",
       "2020-06-17 15:22:10  0.471952 -1.443687 -0.433773  0.123114 -0.102970  \n",
       "2020-06-17 15:23:10 -0.278979 -1.860566  0.909540 -0.396742  2.455228  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(a)**\n",
    "Use ```GridSearchCV``` on 10-fold CV to find the ```C, gamma``` optimal hyperparameters on a SVC with RBF kernel, where ```param_grid={'C':[1E-2,1E-1,1,10,100],'gamma':[1E-2,1E-1,1,10,100]}``` and the scoring function is ```neg_log_loss```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Settings\n",
    "cv_gen = PurgedKFold(n_splits=10, samples_info_sets=cont['t1'])\n",
    "param={'C':[1e-2,1e-1,1,10,100],'gamma':[1e-2,1e-1,1,10,100]}\n",
    "est = SVC(kernel='rbf', probability=True)\n",
    "scoring = 'neg_log_loss'\n",
    "\n",
    "#GridSearchCV\n",
    "gs_cv = GridSearchCV(estimator=est, param_grid=param, cv=cv_gen, scoring=scoring, iid=False, n_jobs=-1)\n",
    "#Run Grid Serach\n",
    "start = time.time()\n",
    "pipe_gs = gs_cv.fit(X, cont['bin'], sample_weight=cont['w'])\n",
    "end = time.time()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(b)**\n",
    "How many nodes are there in the grid?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The number nodes in the grid is 25\n"
     ]
    }
   ],
   "source": [
    "print('The number nodes in the grid is', len(param['C'])*len(param['gamma']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(c)**\n",
    "How many fits did it take to find the optimal solution?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "15 fits taken to find the optimal solution\n"
     ]
    }
   ],
   "source": [
    "print(pipe_gs.best_index_ , 'fits taken to find the optimal solution')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(d)**\n",
    "How long did it take to find this solution?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken to find the solution is 1481.3452606201172 secs\n"
     ]
    }
   ],
   "source": [
    "print(f'Time taken to find the solution is {end-start} secs')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(e)**\n",
    "How can you access the optimal result?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SVC(C=10, gamma=0.01, probability=True)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#The optimal result can be accesed in the following way\n",
    "pipe_gs.best_estimator_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(f)**\n",
    "What is the CV score of the optimal parameter combination?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The best CV score for GridSearchCV is 0.40730892035385785 log_loss\n"
     ]
    }
   ],
   "source": [
    "print(f'The best CV score for GridSearchCV is {abs(pipe_gs.best_score_)} log_loss')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(g)**\n",
    "How can you pass sample weights to the SVC?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SVC(C=10, gamma=0.01, probability=True)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Sample weights can be passed in following way \n",
    "#First we get the best estimator (SVC)\n",
    "best_svc = pipe_gs.best_estimator_\n",
    "#Then we fit the SVC\n",
    "best_svc.fit(X, cont['bin'], sample_weight=cont['w'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 9.2 \n",
    "Using the same dataset from exercise 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(a)** Use ```RandomizedSearchCV``` on 10-fold CV to find the ```C, gamma``` optimal hyper-parameters on an SVC with RBF kernel, where ```param_distributions={'C':logUniform(a=1E-2,b=1E2),'gamma':logUniform(a=1E-2,b=1E2)},n_iter=25``` and ```neg_log_loss``` is the scoring function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# CODE SNIPPET 9.4 THE logUniform_gen CLASS\n",
    "class logUniform_gen(rv_continuous):\n",
    "    #Random numbers log-uniformly distributed between 1 and e\n",
    "    def _cdf(self,x):\n",
    "        return np.log(x/self.a)/np.log(self.b/self.a)\n",
    "    \n",
    "def logUniform(a=1,b=np.exp(1)):\n",
    "    '''\n",
    "    This function creates a uniformly distributed\n",
    "    random samples in a log-scale of a and b.\n",
    "    '''\n",
    "    return logUniform_gen(a=a,b=b,name='logUniform')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Setting Up the RandomSerachSV \n",
    "cv_gen = PurgedKFold(n_splits=10, samples_info_sets=cont['t1'])\n",
    "param_dist={'C':logUniform(a=1e-2,b=1e2),'gamma':logUniform(a=1e-2,b=1e2)}\n",
    "n_iter=25\n",
    "est = SVC(kernel='rbf', probability=True)\n",
    "scoring = 'neg_log_loss'\n",
    "#\n",
    "rs_cv = RandomizedSearchCV(estimator=est, param_distributions=param_dist, n_iter=n_iter, cv=cv_gen, \n",
    "                           scoring=scoring, iid=False, n_jobs=-1)\n",
    "#Run RandomSerachSV\n",
    "start = time.time()\n",
    "pipe_rs = rs_cv.fit(X, cont['bin'], sample_weight=cont['w'])\n",
    "end = time.time()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(b)**\n",
    "How long did it take to find this solution?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken to find the solution is 1208.8149218559265 secs\n"
     ]
    }
   ],
   "source": [
    "print(f'Time taken to find the solution is {end-start} secs')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(c)**\n",
    "Is the optimal parameter combination similar to the one found in exercise 1?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SVC(C=16.761799747477177, gamma=0.027164728727124766, probability=True)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe_rs.best_estimator_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The optimal parameters ```{C, gamma}``` obtained from the RandomSearchCV are *different* from GridSearchCV. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(d)**\n",
    "What is the CV score of the optimal parameter combination? How does it\n",
    "compare to the CV score from exercise 1?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The best CV score for RandomSearchCV is 0.3622598406083555 log_loss\n"
     ]
    }
   ],
   "source": [
    "print(f'The best CV score for RandomSearchCV is {abs(pipe_rs.best_score_)} log_loss')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We obtain a better perofromance with RandomSearchCV resulting in a *lower* log_loss."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 9.3\n",
    "From exercise 1,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(a)** Compute the Sharpe ratio of the resulting in-sample forecasts, from point 1.a\n",
    "(see Chapter 14 for a definition of Sharpe ratio)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sharpe_ratio(y_true : np.array, y_pred : np.array) -> float:\n",
    "    \"\"\"\n",
    "    A function to generate sharpe ratio out of model prediction,\n",
    "    if the prediction is a 1 and the label is also 1 we consider this as gain ,\n",
    "    else if prediction is a 0 and the label is 1 or 0 we consider that no action is taken hence 0 gain (no action taken),\n",
    "    else if prediction is a 1 and the label is 0 we consider it as loss hence -1 gain.\n",
    "    \"\"\"\n",
    "    if len(y_true) == len(y_pred):\n",
    "        returns = []\n",
    "        for i in range(len(y_true)):\n",
    "            t = y_true[i]\n",
    "            p = y_pred[i]\n",
    "            if t == 1 and p == 1:\n",
    "                returns.append(1)\n",
    "            elif t == 0 and p == 1:\n",
    "                returns.append(-1)\n",
    "        return np.mean(returns)/np.std(returns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Getting the best SVC obtained from the RandomSearchCV with neg_log_loss scoring\n",
    "gs_svc = pipe_gs.best_estimator_\n",
    "#Then we fit the SVC\n",
    "gs_svc.fit(X.values, cont['bin'].values, sample_weight=cont['w'])\n",
    "#Getting the in-sample predictions\n",
    "pred = gs_svc.predict(X.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sharpe ratio of the resulting in-sample forecasts is 0.322747512496089\n"
     ]
    }
   ],
   "source": [
    "SR = sharpe_ratio( cont['bin'].values, pred)\n",
    "print(f'Sharpe ratio of the resulting in-sample forecasts is {SR}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(b)** Repeat point 1.a, this time with accuracy as the scoring function. Compute\n",
    "the in-sample forecasts derived from the hyper-tuned parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "scoring = 'accuracy'\n",
    "#GridSearchCV with accuracy scoring\n",
    "gs_cv_acc = GridSearchCV(estimator=est, param_grid=param, cv=cv_gen, scoring=scoring, iid=False, n_jobs=-1)\n",
    "pipe_gs_acc = gs_cv_acc.fit(X, cont['bin'], sample_weight=cont['w'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Getting the best SVC obtained from the RandomSearchCV with accuracy scoring\n",
    "gs_svc_acc = pipe_gs_acc.best_estimator_\n",
    "#Then we fit the SVC\n",
    "gs_svc_acc.fit(X.values, cont['bin'].values, sample_weight=cont['w'])\n",
    "#Getting the in-sample predictions\n",
    "pred = gs_svc_acc.predict(X.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sharpe ratio of the resulting in-sample forecasts is 0.8748291920672977\n"
     ]
    }
   ],
   "source": [
    "SR = sharpe_ratio( cont['bin'].values, pred)\n",
    "print(f'Sharpe ratio of the resulting in-sample forecasts is {SR}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(c)** What scoring method leads to higher (in-sample) Sharpe ratio?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The accuracy scoring leads to a higher in-sample Sharpe ratio, given that the sizes of all bets are equal (regardless of the forcast confidence)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 9.4\n",
    "From exercise 2,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(a)**\n",
    "Compute the Sharpe ratio of the resulting in-sample forecasts, from point 2.a."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Getting the best SVC obtained from the RandomSearchCV with neg_log_loss scoring\n",
    "rs_svc = pipe_rs.best_estimator_\n",
    "#Then we fit the SVC\n",
    "rs_svc.fit(X.values, cont['bin'].values, sample_weight=cont['w'])\n",
    "#Getting the in-sample predictions\n",
    "pred = rs_svc.predict(X.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sharpe ratio of the resulting in-sample forecasts is 0.814227755072263\n"
     ]
    }
   ],
   "source": [
    "SR = sharpe_ratio( cont['bin'].values, pred)\n",
    "print(f'Sharpe ratio of the resulting in-sample forecasts is {SR}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(b)** Repeat point 1.a, this time with accuracy as the scoring function. Compute\n",
    "the in-sample forecasts derived from the hyper-tuned parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "scoring = 'accuracy'\n",
    "\n",
    "rs_cv_acc = RandomizedSearchCV(estimator=est, param_distributions=param_dist, n_iter=n_iter, cv=cv_gen, \n",
    "                               scoring=scoring, iid=False, n_jobs=-1)\n",
    "pipe_rs_acc = rs_cv_acc.fit(X, cont['bin'], sample_weight=cont['w'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Getting the best SVC obtained from the RandomSearchCV with accuracy scoring\n",
    "rs_svc_acc = pipe_rs_acc.best_estimator_\n",
    "#Then we fit the SVC\n",
    "rs_svc_acc.fit(X.values, cont['bin'].values, sample_weight=cont['w'])\n",
    "#Getting the in-sample predictions\n",
    "pred = rs_svc_acc.predict(X.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sharpe ratio of the resulting in-sample forecasts is 0.8405700524269539\n"
     ]
    }
   ],
   "source": [
    "SR_acc = sharpe_ratio( cont['bin'].values, pred)\n",
    "print(f'Sharpe ratio of the resulting in-sample forecasts is {SR_acc}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(c)** What scoring method leads to higher (in-sample) Sharpe ratio?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The neg_log_loss scoring method produces a  better Sharp Ratio ,given that all bet sizes are equal regardless of confidence (probability)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 9.5\n",
    "Read the definition of log loss, ```L [Y, P]```."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(a)** Why is the scoring function ```neg_log_loss``` defined as the negative log loss, ```−L [Y, P]```?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The intuition behind changing the sign of the los_loss function is that we want to maximize the negative log_loss which will return a lower log_loss value. Basically, sklearn tends to maximize a function while optimizing a model (like maximizing the accuracy). The reason behind using the negative log_loss (instead of using accuracy) of hyper-parameter optimization is completely due to the fact that we are optimizing a model of an investment strategy *(see 9.4 page 134 of AFML for more details).*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(b)** What would be the outcome of maximizing the log loss, rather than the negative\n",
    "log loss?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will result quite the opposite of what we wanted. If we search for the parameters that will result in the highest log_loss possible, we will end up getting the worst combinations of parameters. (to better understand log_loss *ref : https://www.kaggle.com/dansbecker/what-is-log-loss*)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 9.6\n",
    "Consider an investment strategy that sizes its bets equally, regardless of the forecast’s\n",
    "confidence. In this case, what is a more appropriate scoring function for\n",
    "hyper-parameter tuning, accuracy or cross-entropy loss?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Log loss aka cross-entropy loss takes the confidence of a prediction in account while scoring a prediction. There may be times when a classifier may predict signal with low confidence and it results in a gain, also sometimes the classifier predicts signal with high confidance and it results a loss. So,during this scenario the cross-entropy loss will not offset the loss from the high confidance prediction *(see page 134 of AFML)*. <br>\n",
    "But on the other hand we can offset a miss with high probability with a hit with low probability and it does consider the confidance of predictions. Since the investment strategy doesn't consider the confidence of a prediction we can consider using the **accuracy** for scoring."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Key Takeaways** : - <br>\n",
    "* Use PurgedKFold class as CV generator while tuning in order to prevent overfits of the ML estimator to leaked information. ( 9.2 ; 2nd paragraph)\n",
    "* Use scoring='f1' in the context of meta-labeling applications. (9.2 ; 3rd paragraph)\n",
    "* Use neg_log_loss when you are tuning hyper-parameters for an investment strategy as it can account for the probability of hit and miss effectively than accuracy scoring. (9.4 ; 1st paragraph)\n",
    "* Sampling from a uniform distribution would be inefficient ; e.g. if we sample a parameter from a uniform distribution  ```U[0,100]``` , 99% of the values would be expected to be greater than 1; which is not the effective way to  exploring the feasibility region of parameters. Hence the author suggests using log-uniform distribution. (9.3.1)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mlfinlab",
   "language": "python",
   "name": "mlfinlab"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
